Table of Contents

Submitted by:

1. Nardine Yousry Kamel Megali 18P6427

--I did the implementation of Naive Bayes as bonus--

Header files included

Loading the dataset:

So in our data set we have 863 records with 15 features

Preprocessing

Correlation Analysis

Correlation is a fundamental property of our variables. It's a general purpose measurement that can be taken before any modeling. The higher the correlation between that target variable and the predictor variables is the better performance.

Splitting that data where the result is in y, and the x has all other columns

Visualising the dataset

1. Plotting the data

2. Age VS Result: Since the colleration form the heatmap graph was very high

3. symptom1 VS symptom2: Since the colleration form the heatmap graph was very high

4. symptom2 VS symptom3: Since the colleration form the heatmap graph was very high

3. symptom4 VS symptom3: Since the colleration form the heatmap graph was very high

Conclusion

From scatter plots, there is no strong colleration between any of the symptoms

Therefore, We can't drop any columns!

To have better view about Features's distributions

Removing symptom5 & symptom6 since they have only one same value

PHASE_1 Classfication:

The data used in this project will help to identify whether a person is going to recover from coronavirus symptoms or not based on some pre-defined standard symptoms.

First Milstone of the project we did the KNN, Naive Bayes, and Logistic Regression classifiers to see whether somebody will recorver or die based on the given dataset.

In evey classifier we will try preprocessing on the data, and we compare them to see the best model from each of the classifiers.

--> I call the Dataset every time to check that it's the orginal data without any normalization or any model affecting it

1. K-Nearest-Neighbour Classifier

In this classifier we will use different models as:

  1. One Hot Enconding

  2. Normalisation

  3. Normalisation with One hot Encoding

and see which works better and have best ROC and AUC as well as F2-Score

A. With One-Hot-Encoding

Spliting Data Train/Test :

Hyperparameter Tuning

Plotting Recall and F-Beta at different Values of K:

Printing the k

Knn classifier with k neighbors

Fitting the Model

Get accuracy. Note: In case of classification algorithms score method represents accuracy.

Confusion_matrix

The table that is used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The Scikit-learn provides facility to calculate confusion matrix using the confusion_matrix method.

Classification Report

Another important report is the Classification report. It is a text summary of the precision, recall, F1 score for each class. Scikit-learn provides facility to calculate Classification report using the classification_report method.

ROC (Reciever Operating Charecteristic)

It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.

An ROC curve demonstrates several things:

1) It shows the tradeoff between sensitivity and specificity

2) The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.(area under the curve kol lma tkon kber kda da ma3nha ahla)

3)The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

4) The area under the curve is a measure of test accuracy.

B. With Normalization

Splitting Data Train/Test

Hyperparameter Tuning

Plotting Recall and F-Beat at different Values of K:

Printing the k

Knn classifier with k neighbors:

Fitting the Model

Get accuracy. Note: In case of classification algorithms score method represents accuracy.

Confusion_matrix

Classification Report

ROC (Reciever Operating Charecteristic)

C. Normalization and One-Hot-Encoding

Splitting Data Train/Test:

Hyperparameter Tuning

Plotting Recall and F-Beat at different Values of K:

Printing the k

knn classifier with k neighbors

Fitting the Model

Get accuracy. Note: In case of classification algorithms score method represents accuracy.

Confusion_matrix

Classification Report

ROC (Reciever Operating Charecteristic)

Comparison between the 3 models with respect to ROC-AUC curve

Comparison between the 3 models with respect to F2

Comparison between the 3 models with respect to Recall Perscrion AUC

KNN Conclusion

Based on the pervious graph since we will compare with F-Beta and ROC: the best model is With Normalization Only.


Naïve Bayes Classifier

we use Gaussian NB Classifier. We first split the data to train, validation and test sets as 70%, 15% and 15% respectively.

A. Without Hyperparamter Tuning

Splitting Train/Test:

Model Traning

Confusion Matrix

Classification Report

B. With Hyperparameter Tuning

There is two hypterparamters in gaussian prior and var_smoothing

  1. the prior: has ratio known as α, that gives prefer one class than others
  2. var_smoothing: it's responsible for variance distbustion

Splitting Train/Test:

Validation to Get Best Hyperparameters:

Training the Model with the best hyperparameter:

Confusion Matrix

Classification Report

Comparison between the 2 models with respect to ROC-AUC curve

Comparison between the 2 models with respect to F2

Comparison between the 3 models with respect to Recall Perscrion AUC

Naïve Bayes Conclusion

Based on the pervious graph since we will compare with F-Beta: the best model is With Hyperparamter Tuning.


Implementation of Naïve Bayes

-EXTRA PART-


Logistic Regression Classifier

We first split the data to train, validation and test sets as 70%, 15% and 15% respectively.

A. Without Hyperparameters Tuning

Spliting Data Train/Test :

Now: We will use the StandardScaler to scale the features before using the GridSearchCV

Traning the Model

Confusion_matrix

Classification Report

ROC (Reciever Operating Charecteristic)

B. With Hyperparameter Tuning:

Validation to Get the Best Hyperparameters

Spliting Data Train/Test

Traning the Model

Confusion Matrix

Classification Report

C. One-Hot-Encoding

Spltting data Train/Test

Validation to get the best hyperparmeter:

Traning the Model

Confusion Matrix

Classification Report

Comparison between the 3 models with respect to ROC-AUC curve

Comparison between the 3 models with respect to F2

Comparison between the 3 models with respect to Recall Perscrion AUC

Conclusion:

One Hot encoding model affects the LR; as shown in the previous graphs. class_weight since the “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))


PHASE_2 Classfication

In the second Milstone of the project we did the Decision Tree, and SVM classifiers to see whether somebody will recorver or die based on the given dataset.

In evey classifier we will try preprocessing on the data, and we compare them to see the best model from each of the classifiers.

1. Decision Tree Classifier

In this classifier we will use different models as:

  1. Without Hyperprameter tuning

  2. With Hyperprameter tuning

and see which works better and have best ROC and AUC as well as F2-Score

A. Without Hyperprameter tuning

Spliting Data Train/Test :

Decision Tree Classifier with criterion entropy

Training the Model:

Visualizing the Tree

Confusion Matrix

Classification Report

Decision Tree Classifier with criterion gini index:

Visualizing the Tree

Confusion Matrix

Classification Report

Concluded that: We already choose the F2-score for the main comparison since in our case in a medical dataset,we want to decrease the number of False Negative therefore we scored based on the F2-score, that's why in both case the entropy and gini index have the same F2-scroe. However there is a slightly difference between them in the recall and precision, that's small difference shows that the entropy is better since both the recall and precision are high= 0.94 as well as the ROC cover is greater = 0.9649

B. With Hyperprameter tuning

Training the Model with the best hyperprameter:

Confusion Matrix

Classification Report

ROC (Reciever Operating Charecteristic)

Comparison between the 2 models with respect to ROC-AUC curve

Without hyperparamter tuning we will compare with the entropy since it's already shown that the entropy has better recall and presion as well as roc (mentioned above).

Comparison between the 2 models with respect to F2


2. Support Vector Machine

We will try the classifer once without hyperparamter tuning and once with the hyperparamter tunung

The hyperparamters are:

  1. Kernels: The main function of the kernel is to take low dimensional input space and transform it into a higher-dimensional space. It is mostly useful in non-linear separation problem.

  2. C (Regularisation): C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimisation how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term.

  3. Gamma: It defines how far influences the calculation of plausible line of separation, when gamma is higher, nearby points will have high influence; low gamma means far away points also be considered to get the decision boundary

A. Without Hyperparameters Tuning

Spliting Data Train/Test:

Now: We will use the StandardScaler to scale the features before using the GridSearchCV

Traning the Model:

Confusion_matrix

Classification Report

ROC (Reciever Operating Charecteristic)

B. With Hyperparameters Tuning

Traning the Model with the best hyperpramater

Confusion Matrix

Classification Report

ROC (Reciever Operating Charecteristic)

Comparison between the 2 models with respect to F2

Evaluation and comparision of all the models